For this exercise, we will use the same subset of the GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany data as in the presentation. Just run the following code to go through the wrangling pipeline. Remember that the .csv file should be stored in the data folder in the directory with the course materials.
library(tidyverse)
library(naniar)
gesis_panel_corona <- read_csv2("../../../data/ZA5667_v1-1-0.csv")
missings <- c(-111, -99, -77, -33, -22)
corona_survey <- gesis_panel_corona %>%
select(id,
sex:education_cat,
choice_of_party,
left_right = political_orientation,
risk_self = hzcy001a,
risk_surround = hzcy002a,
avoid_places = hzcy006a,
keep_distance = hzcy007a,
wash_hands = hzcy011a,
stockup_supplies = hzcy013a,
reduce_contacts = hzcy014a,
wear_mask = hzcy015a,
trust_rki = hzcy047a,
trust_government = hzcy048a,
trust_chancellor = hzcy049a,
trust_who = hzcy051a,
trust_scientists = hzcy052a,
info_national_public_tv = hzcy084a,
info_national_newspaper = hzcy086a,
info_local_newspaper = hzcy089a,
info_facebook = hzcy090a,
info_other_social_media = hzcy091a) %>%
replace_with_na_all(condition = ~.x %in% missings) %>%
replace_with_na(replace = list(choice_of_party = c(97,98),
risk_self = c(97),
risk_surround = c(97),
trust_rki = c(98),
trust_government = c(98),
trust_chancellor = c(98),
trust_who = c(98),
trust_scientists = c(98))) %>%
mutate(sex = recode_factor(sex,
`1`= "Male",
`2` = "Female"),
education_cat = recode_factor(education_cat,
`1` = "Low",
`2` = "Medium",
`3`= "High",
.ordered = TRUE),
age_cat = recode_factor(age_cat,
`1`= "<= 25 years",
`2`= "26 to 30 years",
`3` = "31 to 35 years",
`4` = "36 to 40 years",
`5` = "41 to 45 years",
`6` = "46 to 50 years",
`7` = "51 to 60 years",
`8` = "61 to 65 years",
`9`= "66 to 70 years",
`10` = ">= 71 years",
.ordered = TRUE),
choice_of_party = recode_factor(choice_of_party,
`1`= "CDU/CSU",
`2`= "SPD",
`3` = "FDP",
`4` = "Linke",
`5` = "Gruene",
`6` = "AfD",
`7` = "Other")
) %>%
mutate(sum_measures = avoid_places +
keep_distance +
wash_hands +
stockup_supplies +
reduce_contacts +
wear_mask,
sum_sources = info_national_public_tv +
info_national_newspaper +
info_local_newspaper +
info_facebook +
info_other_social_media) %>%
rowwise() %>%
mutate(mean_trust = mean(c(trust_rki,
trust_government,
trust_chancellor,
trust_who,
trust_scientists),
na.rm = TRUE)) %>%
ungroup()
As we will use the same dataset again in the next exercise in this session, it makes sense to save it. To preserve the information about the variable types, it is best to save it as a .rds file. You can do this with the following command:
saveRDS(corona_survey, "../data/gp_corona_subset.rds")
In case you have not done so, please also install the summarytools and the GGally package. The following code chunk will check if you have these packages installed and install them, if that is not the case.
if (!require(summaryrtools)) install.packages("summarytools")
if (!require(summaryrtools)) install.packages("GGally")
education_cat. Also include the counts for missing values.
table() function from base R for this.
table(corona_survey$age_cat, useNA = "always")
##
## <= 25 years 26 to 30 years 31 to 35 years 36 to 40 years 41 to 45 years
## 107 267 276 328 317
## 46 to 50 years 51 to 60 years 61 to 65 years 66 to 70 years >= 71 years
## 367 978 386 357 382
## <NA>
## 0
In the following, we will use different joins to create datasets that contain the same set of variables. We will create two versions of the combined dataset.
Before we do this, however, we want to explore the overlap and discrepancies between the individual datasets. This is somewhat easier to do with the datasets in wide format (as each country name only appears in one row in those).
summarytools package to get summary statistics for the following variables in your dataset: left_right, sum_measures, mean_trust.
dplyr package with descr() from summarytools.
library(summarytools)
corona_survey %>%
select(left_right,
sum_measures,
mean_trust) %>%
descr()
## Descriptive Statistics
## corona_survey
## N: 3765
##
## left_right mean_trust sum_measures
## ----------------- ------------ ------------ --------------
## Mean 4.66 3.98 3.77
## Std.Dev 1.86 0.75 1.16
## Min 0.00 1.00 0.00
## Q1 3.00 3.60 3.00
## Median 5.00 4.00 4.00
## Q3 6.00 4.60 5.00
## Max 10.00 5.00 6.00
## MAD 1.48 0.59 1.48
## IQR 3.00 1.00 2.00
## CV 0.40 0.19 0.31
## Skewness -0.10 -0.94 -1.14
## SE.Skewness 0.04 0.04 0.04
## Kurtosis -0.16 1.01 1.43
## N.Valid 3678.00 3157.00 3186.00
## Pct.Valid 97.69 83.85 84.62
summarytools to display the counts and frequencies for the categories in the age_cat variable.
freq()
freq(corona_survey$age_cat)
## Frequencies
## corona_survey$age_cat
## Type: Ordered Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## -------------------- ------ --------- -------------- --------- --------------
## <= 25 years 107 2.84 2.84 2.84 2.84
## 26 to 30 years 267 7.09 9.93 7.09 9.93
## 31 to 35 years 276 7.33 17.26 7.33 17.26
## 36 to 40 years 328 8.71 25.98 8.71 25.98
## 41 to 45 years 317 8.42 34.40 8.42 34.40
## 46 to 50 years 367 9.75 44.14 9.75 44.14
## 51 to 60 years 978 25.98 70.12 25.98 70.12
## 61 to 65 years 386 10.25 80.37 10.25 80.37
## 66 to 70 years 357 9.48 89.85 9.48 89.85
## >= 71 years 382 10.15 100.00 10.15 100.00
## <NA> 0 0.00 100.00
## Total 3765 100.00 100.00 100.00 100.00
summarytools function to create a crosstable for the variables sex and education_cat.
ctable().
ctable(corona_survey$sex, corona_survey$education_cat)
## Cross-Tabulation, Row Proportions
## sex * education_cat
## Data Frame: corona_survey
##
## -------- --------------- ------------- -------------- -------------- ---------------
## education_cat Low Medium High Total
## sex
## Male 255 (13.2%) 526 (27.2%) 1152 (59.6%) 1933 (100.0%)
## Female 168 ( 9.2%) 628 (34.3%) 1036 (56.6%) 1832 (100.0%)
## Total 423 (11.2%) 1154 (30.7%) 2188 (58.1%) 3765 (100.0%)
## -------- --------------- ------------- -------------- -------------- ---------------
correlation package to calculate and print correlations between the following variables: risk_self, risk_surround, sum_measures, sum_sources
select from dplyr and correlation() from the package with the same name.
library(correlation)
corona_survey %>%
select(risk_self,
risk_surround,
sum_measures,
sum_sources) %>%
correlation()
## Parameter1 | Parameter2 | r | 95% CI | t | df | p | Method | n_Obs
## ---------------------------------------------------------------------------------------------
## risk_self | risk_surround | 0.76 | [0.75, 0.78] | 65.29 | 3075 | < .001 | Pearson | 3077
## risk_self | sum_measures | 0.16 | [0.13, 0.20] | 9.29 | 3146 | < .001 | Pearson | 3148
## risk_self | sum_sources | 0.06 | [0.03, 0.10] | 3.62 | 3129 | < .001 | Pearson | 3131
## risk_surround | sum_measures | 0.14 | [0.11, 0.17] | 7.89 | 3098 | < .001 | Pearson | 3100
## risk_surround | sum_sources | 0.09 | [0.06, 0.13] | 5.06 | 3081 | < .001 | Pearson | 3083
## sum_measures | sum_sources | 0.13 | [0.09, 0.16] | 7.16 | 3166 | < .001 | Pearson | 3168
GGally package. The plot should include the coefficients rounded to two decimal places as labels.
ggcorr().
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
corona_survey %>%
select(risk_self,
risk_surround,
sum_measures,
sum_sources) %>%
ggcorr(label = TRUE,
label_round = 2)